Goto

Collaborating Authors

 video representation


HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model

Neural Information Processing Systems

Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities.In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments.Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.Project page: https://uark-aicv.github.io/HENASY


Unsupervised Learning of View-invariant Action Representations

Neural Information Processing Systems

The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.





Appendices

Neural Information Processing Systems

Note thatppos is task-specific; here we use the class oracle,i.e. the ImageNet-100 labels,todefinethepositivesamples. In Figure 1, we plot theproxy task performance, i.e. the percentage of queries where the key is ranked over all negatives, across training for MoCo [19], MoCo-v2 [10] and some variants inbetween. As mentioned above, all results in Figure1areforthesameฯ„ =0.2. Ablations showed that this yields at best performance as good as mixingwiththequery,butonaverageabout0.1-0.2%lower. This weighing scheme also resulted in slightly inferior results.




Cycle-ContrastforSelf-SupervisedVideo RepresentationLearning

Neural Information Processing Systems

These methods giveeffectiverepresentations and decent results ofdownstream tasks, however we suggest that utilizing other nature characteristics of video can lead to different yet representative video representations.


Self-supervisedCo-training forVideoRepresentationLearning

Neural Information Processing Systems

Weshowthattheanswerisno,intworespects: First, we show that hard positives are being neglected in the self-supervised training, and that if these hard positives are included then the quality of learnt representation improves significantly. Toinvestigatethis,weconduct anoracleexperiment where positivesamples areincorporated into the instance-based training process based on the semantic class label.